Loading and Saving Data in R

Author

Martin Schweinberger

Introduction

Prerequisite Tutorials

Before working through this tutorial, you should be familiar with the content of the following:

If you are new to R, please work through Getting Started with R before proceeding.

Learning Objectives

By the end of this tutorial you will be able to:

  1. Load tabular data from plain text (.csv, .tsv, .txt), Excel (.xlsx), R-native (.rda, .rds), JSON, and XML formats into R
  2. Save R data objects back to each of those formats using appropriate functions
  3. Load data directly from a URL without downloading it manually
  4. Access built-in datasets from base R and installed R packages
  5. Load a single plain-text file and a directory of multiple text files into R for corpus analysis
  6. Read text from Microsoft Word (.docx) files using the officer package
Citation

Schweinberger, Martin. 2026. Loading and Saving Data in R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/load/load.html (Version 2026.05.01).

This tutorial covers two foundational data-management skills for linguistic research in R: loading data from a wide variety of file formats into your R session, and saving processed data and R objects back to disk in appropriate formats.

Data rarely arrive in a single tidy format. A corpus might be spread across hundreds of plain-text files; an experimental dataset might come from a collaborator as an Excel spreadsheet; a frequency list might be stored as an R object from a previous session; metadata might be embedded in a JSON file exported from a web API; and survey responses might be in an SPSS .sav file. Knowing how to read and write data in R is therefore not a preliminary skill to be rushed through — it is a core competency that affects every subsequent step of your analysis.

The tutorial is aimed at beginners to intermediate R users. It assumes you are comfortable with basic R syntax (objects, functions, vectors, data frames) but have no prior experience with the specific packages used here.

Need to Generate Data from Scratch?

If you do not have real data yet and want to create synthetic datasets for method development, teaching, or power analysis, see the companion tutorial: Simulating Data with R.


Project Structure and File Paths

Section Overview

What you will learn: How to set up a reproducible project directory, why the here package is preferred over setwd(), and how to verify that R can find your data files before you try to load them

Why File Paths Matter

Every data-loading command in R requires a file path — the address of the file on your computer (or on the web). Paths that work on your computer will break when you share your script with a colleague, upload it to a server, or move your project to a different folder. The most common source of beginner frustration (“it worked yesterday!”) is a broken file path.

There are two approaches to managing paths: the fragile one and the robust one.

The fragile approach — setwd(): Setting the working directory with setwd("C:/Users/Martin/Documents/myproject") hard-codes an absolute path that is specific to one machine and one folder location. As soon as you move the project, rename a folder, or share the code, it breaks.

The robust approach — RStudio Projects + here: Creating an RStudio Project (.Rproj file) anchors all paths to the project root. The here package then builds paths relative to that root using here::here(), which works identically on Windows, macOS, and Linux regardless of where the project folder lives.

Verifying Paths with here

Code
library(here)

# Check what here() considers the project root
here::here()

# Build a path to a file in the data subfolder
here::here("data", "testdat.csv")

# Check whether the file actually exists at that path
file.exists(here::here("data", "testdat.csv"))

# List all files in the data folder
list.files(here::here("data"))

# List all .txt files in the testcorpus subfolder
list.files(here::here("data", "testcorpus"), pattern = "\\.txt$")
Always Check Before Loading

Run file.exists(your_path) before attempting to load a file. If it returns FALSE, diagnose the problem with list.files() before debugging your loading code — the file path is almost always the issue, not the loading function.


Setup

Installing Packages

Code
# Run once — comment out after installation
install.packages("here")        # robust file paths
install.packages("readr")       # fast CSV/TSV reading (tidyverse)
install.packages("openxlsx")    # read and write Excel files
install.packages("readxl")      # read Excel files (tidyverse)
install.packages("writexl")     # write Excel files (lightweight)
install.packages("jsonlite")    # parse and write JSON
install.packages("xml2")        # parse and write XML
install.packages("haven")       # SPSS, Stata, SAS files
install.packages("dplyr")       # data manipulation
install.packages("tidyr")       # data reshaping
install.packages("stringr")     # string manipulation
install.packages("purrr")       # functional programming (map/walk)
install.packages("ggplot2")     # visualisation
install.packages("officer")     # read Word documents

Loading Packages

Code
library(here)
library(readr)
library(openxlsx)
library(readxl)
library(writexl)
library(jsonlite)
library(xml2)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(ggplot2)
library(officer)

Loading and Saving Plain Text Data

Section Overview

What you will learn: How to load and save tabular plain-text files (CSV, TSV, delimited TXT) using both base R functions and the faster, more consistent readr package; how to diagnose common loading problems; and when to choose each approach

What Is a Plain-Text Tabular File?

A plain-text tabular file stores a data table as human-readable text, with columns separated by a special character called the delimiter. The most common delimiters are:

Common plain-text tabular formats
Format Delimiter File extension Notes
CSV Comma (,) .csv Most common; problems when data contains commas
TSV Tab (\t) .tsv or .txt Safer for text data; less widely used
Semi-colon delimited ; .csv Common in European locales where , is the decimal separator
Pipe delimited \| .txt Used in some corpus annotation formats

Loading CSV Files

Base R: read.csv()

The base R function read.csv() is available without loading any packages and is the default choice for many users:

Code
# Base R CSV loading
datcsv <- read.csv(
  here::here("tutorials/load/data", "testdat.csv"),
  header      = TRUE,    # first row = column names (default TRUE)
  strip.white = TRUE,    # trim leading/trailing whitespace from strings
  na.strings  = c("", "NA", "N/A", "missing")  # treat these as NA
)

# Inspect structure
str(datcsv)
'data.frame':   10 obs. of  2 variables:
 $ Variable1: int  6 65 12 56 45 84 38 46 64 24
 $ Variable2: int  67 16 56 34 54 42 36 47 54 29
Code
head(datcsv)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42
Key Arguments for read.csv()
Key arguments for read.csv()
Argument Default Purpose
header TRUE First row contains column names
sep "," Column delimiter
dec "." Decimal separator
na.strings "NA" Strings to treat as missing
strip.white FALSE Strip whitespace from string fields
encoding "unknown" File encoding (try "UTF-8" for non-ASCII text)
comment.char "" Ignore lines starting with this character

The readr Package: read_csv()

The readr package (part of the tidyverse) provides faster, more consistent alternatives to base R reading functions. Key advantages: it returns a tibble rather than a plain data frame, it prints progress for large files, it guesses column types more reliably, and it produces informative error messages.

Code
# readr CSV loading
datcsv_r <- readr::read_csv(
  here::here("tutorials/load/data", "testdat.csv"),
  col_types  = cols(),      # suppress type-guessing messages
  na         = c("", "NA", "N/A"),
  trim_ws    = TRUE
)

# readr always prints a column specification — inspect it
spec(datcsv_r)
cols(
  Variable1 = col_double(),
  Variable2 = col_double()
)
Code
head(datcsv_r)
# A tibble: 6 × 2
  Variable1 Variable2
      <dbl>     <dbl>
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42
read.csv() vs. read_csv(): Which Should I Use?

Use read.csv() when you need no extra dependencies, are working with small files, or are writing a script that others will run without the tidyverse installed.

Use read_csv() when working with large files (it is 5–10× faster), when you want explicit column-type checking, or when your workflow uses tidyverse throughout. The underscore vs. dot distinction is the only naming difference to remember: read.csv() is base R, read_csv() is readr.

Semi-Colon Delimited CSV

In many European locales the comma is the decimal separator (e.g. 3,14 for π), so CSV files from these locales use a semi-colon as the column delimiter. Both base R and readr provide specialised functions:

Code
# Base R: read.delim with sep = ";"
datcsv2_base <- read.delim(
  here::here("tutorials/load/data", "testdat2.csv"),
  sep    = ";",
  header = TRUE,
  dec    = ","   # comma as decimal separator
)

# readr: read_csv2() handles ; delimiter and , decimal automatically
datcsv2_r <- readr::read_csv2(
  here::here("tutorials/load/data", "testdat2.csv"),
  col_types = cols()
)

head(datcsv2_base)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42

Loading TSV and Other Delimited Files

Code
# readr: read_tsv for tab-separated files
# dattxt_r <- readr::read_tsv(
#   here::here("tutorials/load/data", "testdat.txt"),
#   col_types = cols()
# )

# readr: read_delim for any delimiter
# datpipe <- readr::read_delim(
#   here::here("tutorials/load/data", "testdat_pipe.txt"),
#   delim     = "|",
#   col_types = cols()
# )

# Base R equivalent
dattxt_base <- read.delim(
  here::here("tutorials/load/data", "testdat.txt"),
  sep    = "\t",
  header = TRUE
)
head(dattxt_base)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42

Saving Plain-Text Files

Writing CSV

Code
# Base R: write.csv — adds row numbers by default; suppress with row.names = FALSE
write.csv(
  datcsv,
  file         = here::here("tutorials/load/data", "testdat_out.csv"),
  row.names    = FALSE,   # ALWAYS set this to avoid a spurious row-number column
  fileEncoding = "UTF-8"
)

# readr: write_csv — no row names by default; faster; always UTF-8
readr::write_csv(
  datcsv_r,
  file = here::here("tutorials/load/data", "testdat_out_r.csv")
)

# Semi-colon CSV (European locale)
readr::write_csv2(
  datcsv2_r,
  file = here::here("tutorials/load/data", "testdat2_out.csv")
)
Always Use row.names = FALSE

The base R write.csv() adds a column of row numbers by default (row names). This creates an unnamed first column of integers when the file is re-read, which is almost never what you want. Always set row.names = FALSE when using write.csv(). The readr functions (write_csv, write_tsv) never write row names.

Writing TSV and Other Formats

Code
# TSV
readr::write_tsv(
  datcsv_r,
  file = here::here("tutorials/load/data", "testdat_out.tsv")
)

# Custom delimiter (pipe)
readr::write_delim(
  datcsv_r,
  file  = here::here("tutorials/load/data", "testdat_out_pipe.txt"),
  delim = "|"
)

# Base R: write.table (most flexible)
write.table(
  datcsv,
  file      = here::here("tutorials/load/data", "testdat_out.txt"),
  sep       = "\t",
  row.names = FALSE,
  quote     = FALSE    # suppress quoting of strings (useful for corpus data)
)

You receive a file called responses.csv from a colleague in Germany. When you load it with read.csv(), all numeric columns appear as character strings and one column called Score shows values like "3,14" and "2,71". What is the most likely problem, and how do you fix it?

  1. The file is corrupt — ask the colleague to re-export it
  2. The file uses a semi-colon as the column delimiter and a comma as the decimal separator — use read.csv2() or read.delim(sep = ";", dec = ",")
  3. The file is tab-separated, not comma-separated — use read.delim(sep = "\t")
  4. The Score column contains text responses — convert manually with as.numeric()
Answer

b) The file uses a semi-colon as the column delimiter and a comma as the decimal separator — use read.csv2() or read.delim(sep = ";", dec = ",")

German locale settings use , as the decimal mark (so 3,14 means 3.14) and ; as the CSV column delimiter (so that commas in numbers are not confused with column separators). When you read such a file with read.csv() (which expects , as the delimiter), the entire row is read as one column, and numbers appear as strings. The fix is read.csv2() (base R) or readr::read_csv2(), both of which default to ; delimiter and , decimal. Option (d) would treat the symptom, not the cause.


Loading and Saving Excel Files

Section Overview

What you will learn: How to read and write .xlsx and .xls Excel files using readxl, openxlsx, and writexl; how to work with multi-sheet workbooks; and common pitfalls of Excel data (merged cells, date encoding, mixed-type columns)

Why Excel Handling Deserves Its Own Section

Excel is the most widely used data format outside of programming environments, and linguistic researchers constantly receive data from collaborators, transcription tools, survey platforms, and corpus annotation software in .xlsx format. However, Excel files present challenges that plain-text files do not:

  • Multiple sheets in a single file, only one of which contains the data you need
  • Merged cells and complex headers that break rectangular data assumptions
  • Mixed-type columns where Excel has inferred numeric types for columns that should be character
  • Date columns that Excel stores as integers (days since 1900) and that R must convert
  • Trailing whitespace and invisible characters copied from other software

Loading Excel Files

The readxl Package

readxl is the tidyverse-standard Excel reader. It reads both .xlsx and the older .xls format, has no Java dependency (unlike xlsx), and returns a tibble.

Code
# List all sheets in the workbook before loading
readxl::excel_sheets(here::here("tutorials/load/data", "testdat.xlsx"))
[1] "Sheet 1"
Code
# Load the first sheet
datxlsx <- readxl::read_excel(
  path      = here::here("tutorials/load/data", "testdat.xlsx"),
  sheet     = 1,          # sheet number or name
  col_names = TRUE,       # first row = column names
  na        = c("", "NA", "N/A"),
  trim_ws   = TRUE,
  skip      = 0           # number of rows to skip before reading
)

str(datxlsx)
tibble [10 × 2] (S3: tbl_df/tbl/data.frame)
 $ Variable1: num [1:10] 6 65 12 56 45 84 38 46 64 24
 $ Variable2: num [1:10] 67 16 56 34 54 42 36 47 54 29
Code
head(datxlsx)
# A tibble: 6 × 2
  Variable1 Variable2
      <dbl>     <dbl>
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42
Code
# Load all sheets at once into a named list
all_sheets <- purrr::map(
  readxl::excel_sheets(here::here("tutorials/load/data", "testdat.xlsx")),
  ~ readxl::read_excel(
      path  = here::here("tutorials/load/data", "testdat.xlsx"),
      sheet = .x,
      na    = c("", "NA")
  )
) |>
  purrr::set_names(
    readxl::excel_sheets(here::here("tutorials/load/data", "testdat.xlsx"))
  )

# Access individual sheets by name
# all_sheets[["Sheet1"]]
Specifying Column Types in read_excel()

Excel sometimes guesses column types incorrectly. Use the col_types argument to override:

readxl::read_excel(
  path      = here::here("data", "testdat.xlsx"),
  col_types = c("text", "numeric", "date", "logical")
)

Valid types are "skip", "guess", "logical", "numeric", "date", "text", and "list". Use "text" for ID columns or any column that should never be converted to a number.

The openxlsx Package

openxlsx is the most feature-complete Excel package for R. It can read, write, and format .xlsx files (cell colours, fonts, borders, conditional formatting), which makes it the best choice when your output needs to be presentable as a report.

Code
# Load with openxlsx
datxlsx2 <- openxlsx::read.xlsx(
  xlsxFile   = here::here("tutorials/load/data", "testdat.xlsx"),
  sheet      = 1,
  colNames   = TRUE,
  na.strings = c("", "NA")
)

head(datxlsx2)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42

Saving Excel Files

Simple Saving with writexl

writexl has no dependencies and writes clean .xlsx files extremely fast. Use it whenever you only need to export a data frame without formatting:

Code
writexl::write_xlsx(
  x    = datxlsx,
  path = here::here("tutorials/load/data", "testdat_out.xlsx")
)

# Write multiple sheets: pass a named list
writexl::write_xlsx(
  x    = list(RawData = datcsv, Processed = datxlsx),
  path = here::here("tutorials/load/data", "multisheet_out.xlsx")
)

Formatted Saving with openxlsx

Code
# Simple write
openxlsx::write.xlsx(
  x    = datxlsx2,
  file = here::here("tutorials/load/data", "testdat_openxlsx.xlsx")
)

# Formatted workbook: create, style, save
wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, sheetName = "Results")
openxlsx::writeData(wb, sheet = "Results", x = datxlsx2, startRow = 1, startCol = 1)

# Style the header row
header_style <- openxlsx::createStyle(
  fontColour     = "#FFFFFF",
  fgFill         = "#4472C4",
  halign         = "center",
  textDecoration = "bold",
  border         = "Bottom"
)
openxlsx::addStyle(wb, sheet = "Results", style = header_style,
                   rows = 1, cols = 1:ncol(datxlsx2), gridExpand = TRUE)

# Freeze the top row (useful for large tables)
openxlsx::freezePane(wb, sheet = "Results", firstRow = TRUE)

openxlsx::saveWorkbook(wb,
  file      = here::here("tutorials/load/data", "testdat_formatted.xlsx"),
  overwrite = TRUE
)
Common Excel Pitfalls

Date columns: Excel stores dates as integers (days since 1 January 1900). readxl converts these automatically; openxlsx::read.xlsx() may return them as integers unless you set detectDates = TRUE.

Leading zeros: Excel silently drops leading zeros from numeric-looking strings (e.g. zip codes "01234" become 1234). Protect them with col_types = "text" in read_excel().

Merged cells: Merged cells create NA values in all but the first row of the merge. Use tidyr::fill() to propagate values downward after loading.

Formula cells: By default, readxl reads the cached formula result, not the formula itself. This is almost always what you want.

You load an Excel file containing participant IDs such as "007", "012", "099". After loading with read_excel() you notice they appear as 7, 12, 99 — the leading zeros are gone. What is the most reliable fix?

  1. Re-type the IDs manually in R with paste0("0", dat$ID)
  2. Specify col_types = "text" for the ID column in read_excel() so R reads it as a character string without numeric coercion
  3. Open the file in Excel and format the column as “Text” before loading into R
  4. Use formatC(dat$ID, width = 3, flag = "0") to add zeros back after loading
Answer

b) Specify col_types = "text" for the ID column in read_excel() so R reads it as a character string without numeric coercion

This is the most reliable solution because it prevents the coercion from happening in the first place. Option (c) also works but requires manual intervention each time the file is updated. Option (d) partially fixes the symptom but fails if IDs have different lengths. Option (a) assumes all IDs are exactly 3 digits and only adds one zero, which is incorrect for "007". The best practice is always to protect ID columns and any column with leading-zero strings by specifying col_types = "text".


Loading and Saving R Native Formats

Section Overview

What you will learn: The difference between .rds, .rda / .RData, and workspace saves; when to use each; and best practices for long-term storage of R objects

R Native Formats at a Glance

R has several native serialisation formats. Understanding the differences matters for reproducibility and collaboration:

R native formats compared
Format Extension Stores Load function Save function
RDS .rds One R object readRDS() saveRDS()
RData .rda or .RData One or more named objects load() save()
Workspace .RData (session) All objects in the environment Loaded on startup save.image()
Prefer .rds Over .RData for Data Exchange

When sharing a single dataset with a colleague, always use .rds and readRDS() / saveRDS(). This is because load() silently overwrites any object in your environment that has the same name as the object stored in the .rda file — a common source of difficult-to-debug errors. With readRDS(), you assign the loaded object to a name of your choosing, so there is no risk of collision.

RDS Files

RDS is the recommended format for storing a single R object — a data frame, a list, a fitted model, a character vector, or any other R object.

Code
# Load an RDS file — assign to any name you like
rdadat <- readRDS(here::here("tutorials/load/data", "testdat.rda"))

str(rdadat)
'data.frame':   10 obs. of  2 variables:
 $ Variable1: num  6 65 12 56 45 84 38 46 64 24
 $ Variable2: num  67 16 56 34 54 42 36 47 54 29
Code
head(rdadat)
  Variable1 Variable2
1         6        67
2        65        16
3        12        56
4        56        34
5        45        54
6        84        42
Code
# Save any R object as RDS
saveRDS(
  object   = rdadat,
  file     = here::here("tutorials/load/data", "testdat_out.rds"),
  compress = TRUE   # default; compresses the file (xz, bzip2, or gzip)
)

# Compare compression options
saveRDS(rdadat, here::here("tutorials/load/data", "testdat_xz.rds"),
        compress = "xz")      # smallest file, slowest
saveRDS(rdadat, here::here("tutorials/load/data", "testdat_gz.rds"),
        compress = "gzip")    # medium; good for large data
saveRDS(rdadat, here::here("tutorials/load/data", "testdat_bz2.rds"),
        compress = "bzip2")   # medium

RData Files

.rda / .RData files can store multiple named R objects in a single file. They are useful for bundling related objects together (e.g. a dataset, its metadata, and a pre-fitted model) or for distributing example data with an R package.

Code
# load() places objects directly into the current environment
# and invisibly returns their names
obj_names <- readRDS(here::here("tutorials/load/data", "testdat.rda"))
cat("Objects loaded:", paste(obj_names, collapse = ", "), "\n")
Objects loaded: c(6, 65, 12, 56, 45, 84, 38, 46, 64, 24), c(67, 16, 56, 34, 54, 42, 36, 47, 54, 29) 
Code
# Save multiple objects into one .rda file
x     <- 1:10
y     <- letters[1:5]
my_df <- data.frame(a = 1:3, b = c("x", "y", "z"))

save(x, y, my_df,
     file = here::here("tutorials/load/data", "multiple_objects.rda"))

# To save ALL objects in the current environment (use sparingly)
save.image(file = here::here("tutorials/load/data", "session_snapshot.RData"))
Avoid save.image() for Reproducibility

Saving your entire workspace with save.image() or allowing RStudio to save .RData on exit feels convenient but actively harms reproducibility. Your analysis can only be reproduced if it runs from scratch on clean data — not from a cached state that may contain objects whose provenance is unknown. Set Tools → Global Options → General → Workspace → “Never” for “Save workspace to .RData on exit” in RStudio.

Loading R Data from the Web

R native objects can be loaded directly from a URL without downloading the file first. This is the standard approach for LADAL tutorial data:

Code
# Load an RDS object directly from a URL
webdat <- base::readRDS(url("https://ladal.edu.au/tutorials/load/data/testdat.rda", "rb"))

# Equivalently, for a file on GitHub or any web server:
# webdat <- readRDS(url("https://raw.githubusercontent.com/.../testdat.rda", "rb"))
Code
# CSV from URL (readr handles URLs directly)
web_csv <- readr::read_csv(
  "https://raw.githubusercontent.com/LADAL/data/main/testdat.csv",
  col_types = cols()
)

# Excel from URL (must download to temp file first)
tmp <- tempfile(fileext = ".xlsx")
download.file("https://example.com/testdat.xlsx", destfile = tmp, mode = "wb")
web_xlsx <- readxl::read_excel(tmp)
unlink(tmp)   # delete the temporary file

A colleague sends you an .rda file called results.rda and tells you it contains an object called model_output. You run load("results.rda") in your R session. You already have an object called model_output in your environment from your own analysis. What happens?

  1. R produces an error and does not load the file
  2. R creates a second object called model_output_1 to avoid the conflict
  3. R silently overwrites your existing model_output with the colleague’s version, with no warning
  4. R asks you to confirm before overwriting the existing object
Answer

c) R silently overwrites your existing model_output with the colleague’s version, with no warning

This is one of the most dangerous behaviours of load(). It inserts objects directly into the global environment without checking for name conflicts. Your own model_output will be gone, with no undo. This is why saveRDS() / readRDS() are preferred for data exchange: with readRDS(), you write model_output_colleague <- readRDS("results.rda") and choose the name yourself, so no collision is possible.


Loading and Saving JSON and XML

Section Overview

What you will learn: What JSON and XML are and where linguists encounter them; how to parse both formats into R data frames using jsonlite and xml2; and how to write R data back to these formats

JSON

JSON (JavaScript Object Notation) is the dominant data exchange format for web APIs, annotation tools, and many corpus management systems. It represents data as nested key-value pairs and arrays. Linguists encounter JSON when:

  • Downloading corpus metadata or concordances from a web API (e.g. CLARIN VLO, AntConc, SketchEngine)
  • Working with annotation exports from tools like CATMA, INCEpTION, or Label Studio
  • Reading metadata from language resource repositories (e.g. Glottolog, WALS online API)

Understanding JSON Structure

A simple JSON file looks like this:

{
  "participants": [
    {"id": "P01", "age": 24, "l1": "English", "proficiency": "Advanced"},
    {"id": "P02", "age": 31, "l1": "German",  "proficiency": "Intermediate"},
    {"id": "P03", "age": 28, "l1": "French",  "proficiency": "Advanced"}
  ],
  "study": "L2 Amplifier Use",
  "year": 2026
}

The outer {} is an object (key-value pairs). Square brackets [] denote arrays (ordered lists). Values can be strings, numbers, booleans, null, objects, or arrays — JSON is recursive.

Loading JSON

Code
json_string <- '{
  "participants": [
    {"id": "P01", "age": 24, "l1": "English",  "proficiency": "Advanced"},
    {"id": "P02", "age": 31, "l1": "German",   "proficiency": "Intermediate"},
    {"id": "P03", "age": 28, "l1": "French",   "proficiency": "Advanced"},
    {"id": "P04", "age": 22, "l1": "Japanese", "proficiency": "Intermediate"},
    {"id": "P05", "age": 35, "l1": "Spanish",  "proficiency": "Advanced"}
  ],
  "study": "L2 Amplifier Use",
  "year": 2026
}'

# Parse JSON string into an R list
json_list <- jsonlite::fromJSON(json_string, simplifyDataFrame = TRUE)

# The top-level keys become list elements
names(json_list)
[1] "participants" "study"        "year"        
Code
# The "participants" element is automatically converted to a data frame
participants <- json_list$participants
str(participants)
'data.frame':   5 obs. of  4 variables:
 $ id         : chr  "P01" "P02" "P03" "P04" ...
 $ age        : int  24 31 28 22 35
 $ l1         : chr  "English" "German" "French" "Japanese" ...
 $ proficiency: chr  "Advanced" "Intermediate" "Advanced" "Intermediate" ...
Code
participants
   id age       l1  proficiency
1 P01  24  English     Advanced
2 P02  31   German Intermediate
3 P03  28   French     Advanced
4 P04  22 Japanese Intermediate
5 P05  35  Spanish     Advanced
Code
# Load from a local file
json_data <- jsonlite::fromJSON(
  txt               = here::here("tutorials/load/data", "data.json"),
  simplifyDataFrame = TRUE,   # convert arrays of objects to data frames
  simplifyVector    = TRUE,   # convert scalar arrays to vectors
  flatten           = TRUE    # flatten nested objects into columns
)

# Load from a URL (e.g. a web API)
glottolog_url <- "https://glottolog.org/resource/languoid/id/stan1293.json"
# glottolog_data <- jsonlite::fromJSON(glottolog_url)
simplifyDataFrame = TRUE vs. FALSE

When simplifyDataFrame = TRUE (the default), fromJSON() tries to convert JSON arrays whose elements all have the same keys into a data frame. This is usually what you want. When the JSON structure is irregular (different keys in different elements), set simplifyDataFrame = FALSE to get a pure R list and then reshape manually.

Handling Nested JSON

Real JSON from APIs is often deeply nested. The flatten = TRUE argument and tidyr::unnest() are your main tools:

Code
nested_json <- '{
  "corpus": [
    {
      "text_id": "T001",
      "metadata": {"genre": "academic", "year": 2010, "wordcount": 3241},
      "tokens": 3241
    },
    {
      "text_id": "T002",
      "metadata": {"genre": "fiction", "year": 2015, "wordcount": 8754},
      "tokens": 8754
    },
    {
      "text_id": "T003",
      "metadata": {"genre": "news", "year": 2019, "wordcount": 512},
      "tokens": 512
    }
  ]
}'

# flatten = TRUE unpacks nested objects into dot-separated column names
corpus_df <- jsonlite::fromJSON(
  nested_json, simplifyDataFrame = TRUE, flatten = TRUE
)$corpus
str(corpus_df)
'data.frame':   3 obs. of  5 variables:
 $ text_id           : chr  "T001" "T002" "T003"
 $ tokens            : int  3241 8754 512
 $ metadata.genre    : chr  "academic" "fiction" "news"
 $ metadata.year     : int  2010 2015 2019
 $ metadata.wordcount: int  3241 8754 512
Code
corpus_df
  text_id tokens metadata.genre metadata.year metadata.wordcount
1    T001   3241       academic          2010               3241
2    T002   8754        fiction          2015               8754
3    T003    512           news          2019                512

Saving JSON

Code
# Convert an R object to a JSON string
json_out <- jsonlite::toJSON(
  participants,
  pretty     = TRUE,   # indented, human-readable output
  auto_unbox = TRUE    # single-element arrays written as scalars
)
cat(json_out)

# Write to file
jsonlite::write_json(
  participants,
  path       = here::here("tutorials/load/data", "participants_out.json"),
  pretty     = TRUE,
  auto_unbox = TRUE
)

XML

XML (eXtensible Markup Language) is older than JSON and more verbose, but it remains the dominant format in computational linguistics and digital humanities. Linguists encounter XML in:

  • TEI (Text Encoding Initiative) markup for edited texts, manuscripts, and historical corpora
  • CoNLL-U and related annotation formats (sometimes XML-wrapped)
  • BNC, BNC2014, COCA corpus XML distributions
  • ELAN annotation files (.eaf)
  • Sketch Engine CQL export format

Understanding XML Structure

XML organises data as a tree of nested elements, each with an opening tag, a closing tag, and optionally attributes and text content:

<?xml version="1.0" encoding="UTF-8"?>
<corpus name="MiniCorpus" year="2026">
  <text id="T001" genre="academic">
    <sentence n="1">
      <token pos="NN" lemma="corpus">corpus</token>
      <token pos="NN" lemma="analysis">analysis</token>
    </sentence>
  </text>
</corpus>

Loading XML

Code
xml_string <- '<?xml version="1.0" encoding="UTF-8"?>
<corpus name="MiniCorpus" year="2026">
  <text id="T001" genre="academic">
    <sentence n="1">
      <token pos="DT">The</token>
      <token pos="NN">corpus</token>
      <token pos="VBZ">contains</token>
      <token pos="JJ">linguistic</token>
      <token pos="NNS">tokens</token>
    </sentence>
    <sentence n="2">
      <token pos="NNS">Frequencies</token>
      <token pos="VBP">vary</token>
      <token pos="IN">by</token>
      <token pos="NN">genre</token>
    </sentence>
  </text>
  <text id="T002" genre="fiction">
    <sentence n="1">
      <token pos="PRP">She</token>
      <token pos="VBD">said</token>
      <token pos="RB">very</token>
      <token pos="RB">little</token>
    </sentence>
  </text>
</corpus>'

xml_doc <- xml2::read_xml(xml_string)

# Extract all token elements
tokens_nodeset <- xml2::xml_find_all(xml_doc, ".//token")

token_df <- data.frame(
  text_id = xml2::xml_attr(
    xml2::xml_find_first(tokens_nodeset, "./ancestor::text[1]"), "id"),
  genre   = xml2::xml_attr(
    xml2::xml_find_first(tokens_nodeset, "./ancestor::text[1]"), "genre"),
  sent_n  = xml2::xml_attr(
    xml2::xml_find_first(tokens_nodeset, "./ancestor::sentence[1]"), "n"),
  pos     = xml2::xml_attr(tokens_nodeset, "pos"),
  word    = xml2::xml_text(tokens_nodeset),
  stringsAsFactors = FALSE
)

head(token_df, 10)
   text_id    genre sent_n pos        word
1     T001 academic      1  DT         The
2     T001 academic      1  NN      corpus
3     T001 academic      1 VBZ    contains
4     T001 academic      1  JJ  linguistic
5     T001 academic      1 NNS      tokens
6     T001 academic      2 NNS Frequencies
7     T001 academic      2 VBP        vary
8     T001 academic      2  IN          by
9     T001 academic      2  NN       genre
10    T002  fiction      1 PRP         She
XPath: The Language of XML Navigation

XPath is a mini-language for selecting nodes from an XML tree. The most useful patterns are:

Common XPath patterns for corpus XML
XPath expression Meaning
//token All <token> elements anywhere in the document
.//token All <token> elements within the current context node
//text[@genre='academic'] <text> elements with genre attribute equal to "academic"
//sentence[@n='1']//token All tokens inside sentence 1
//token/@pos The pos attribute of all token elements

Always test XPath expressions with xml2::xml_find_all() and inspect the result before building a full extraction pipeline.

A More Efficient XML Extraction Pattern

Code
# Extract all texts with their metadata, using purrr::map_dfr
corpus_table <- purrr::map_dfr(
  xml2::xml_find_all(xml_doc, ".//text"),
  function(text_node) {
    text_id <- xml2::xml_attr(text_node, "id")
    genre   <- xml2::xml_attr(text_node, "genre")
    tokens  <- xml2::xml_find_all(text_node, ".//token")
    data.frame(
      text_id = text_id,
      genre   = genre,
      pos     = xml2::xml_attr(tokens, "pos"),
      word    = xml2::xml_text(tokens),
      stringsAsFactors = FALSE
    )
  }
)

corpus_table
   text_id    genre pos        word
1     T001 academic  DT         The
2     T001 academic  NN      corpus
3     T001 academic VBZ    contains
4     T001 academic  JJ  linguistic
5     T001 academic NNS      tokens
6     T001 academic NNS Frequencies
7     T001 academic VBP        vary
8     T001 academic  IN          by
9     T001 academic  NN       genre
10    T002  fiction PRP         She
11    T002  fiction VBD        said
12    T002  fiction  RB        very
13    T002  fiction  RB      little

Saving XML

Code
# Build an XML document from scratch
new_xml    <- xml2::xml_new_root("corpus", name = "OutputCorpus", year = "2026")
text_node  <- xml2::xml_add_child(new_xml, "text", id = "T001", genre = "academic")
sent_node  <- xml2::xml_add_child(text_node, "sentence", n = "1")
xml2::xml_add_child(sent_node, "token", pos = "NN", "analysis")
xml2::xml_add_child(sent_node, "token", pos = "VBZ", "requires")
xml2::xml_add_child(sent_node, "token", pos = "NN", "data")

xml2::write_xml(new_xml,
  file     = here::here("tutorials/load/data", "output_corpus.xml"),
  encoding = "UTF-8"
)

You receive a TEI-encoded corpus as an XML file. You want to extract all <w> (word) elements that have a pos attribute of "VBZ". Which XPath expression is correct?

  1. //w[pos='VBZ']
  2. //w[@pos='VBZ']
  3. //w.pos='VBZ'
  4. //w[text()='VBZ']
Answer

b) //w[@pos='VBZ']

In XPath, attributes are referenced with the @ prefix inside square brackets. Option (a) is incorrect because without @, pos refers to a child element named pos, not an attribute. Option (c) is not valid XPath syntax. Option (d) selects <w> elements whose text content is "VBZ" — matching words literally spelled “VBZ”, not words tagged as VBZ.


Loading Built-In and Package Datasets

Section Overview

What you will learn: How to access datasets built into base R and into installed packages; how to find and browse available datasets; and how to use them as starting points for examples and practice

Base R Datasets

R ships with a large collection of built-in datasets that are immediately available without downloading anything. For linguists, they provide convenient practice data and well-documented benchmarks.

Code
# List datasets across all installed packages (first 20)
all_datasets <- data(package = .packages(all.available = TRUE))$results |>
  as.data.frame() |>
  dplyr::select(Package, Item, Title) |>
  head(20)

all_datasets
     Package                            Item
1  data.tree                            acme
2  data.tree                        mushroom
3      dplyr                band_instruments
4      dplyr               band_instruments2
5      dplyr                    band_members
6      dplyr                        starwars
7      dplyr                          storms
8    ggplot2                        diamonds
9    ggplot2                       economics
10   ggplot2                  economics_long
11   ggplot2                       faithfuld
12   ggplot2                     luv_colours
13   ggplot2                         midwest
14   ggplot2                             mpg
15   ggplot2                          msleep
16   ggplot2                    presidential
17   ggplot2                           seals
18   ggplot2                       txhousing
19  openxlsx     openxlsxFontSizeLookupTable
20  openxlsx openxlsxFontSizeLookupTableBold
                                                               Title
1                     Sample Data: A Simple Company with Departments
2                         Sample Data: Data Used by the ID3 Vignette
3                                                    Band membership
4                                                    Band membership
5                                                    Band membership
6                                                Starwars characters
7                                                  Storm tracks data
8                           Prices of over 50,000 round cut diamonds
9                                            US economic time series
10                                           US economic time series
11                          2d density estimate of Old Faithful data
12                                           'colors()' in Luv space
13                                              Midwest demographics
14 Fuel economy data from 1999 to 2008 for 38 popular models of cars
15      An updated and expanded version of the mammals sleep dataset
16                   Terms of 12 presidents from Eisenhower to Trump
17                                    Vector field of seal movements
18                                               Housing sales in TX
19                                           Font Size Lookup tables
20                                           Font Size Lookup tables
Code
# Load built-in datasets by name (no file path needed)
data("iris")       # Fisher's iris measurements — classic ML benchmark
data("mtcars")     # Motor Trend car road tests — classic regression example

# For linguistics: letter frequency data
data("letters")    # 26 lowercase letters
data("LETTERS")    # 26 uppercase letters
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Linguistics-Relevant Package Datasets

Code
# English letter frequencies (approximate, from standard references)
letter_freq <- data.frame(
  letter    = letters,
  frequency = c(8.2,1.5,2.8,4.3,12.7,2.2,2.0,6.1,7.0,0.15,
                0.77,4.0,2.4,6.7,7.5,1.9,0.10,6.0,6.3,9.1,
                2.8,0.98,2.4,0.15,2.0,0.074)
)
letter_freq |>
  dplyr::arrange(desc(frequency)) |>
  head(10)
   letter frequency
1       e      12.7
2       t       9.1
3       a       8.2
4       o       7.5
5       i       7.0
6       n       6.7
7       s       6.3
8       h       6.1
9       r       6.0
10      d       4.3
Code
# The 'languageR' package contains many linguistic datasets
# (install if needed: install.packages("languageR"))
# data("english",    package = "languageR")    # English lexical decision data
# data("regularity", package = "languageR")    # Morphological regularity
# data("ratings",    package = "languageR")    # Word familiarity ratings

# The 'corpora' package
# data("BNCcomma", package = "corpora")        # BNC frequency data
Finding Datasets in a Package
data(package = "datasets")     # list all datasets in the datasets package
data(package = "languageR")    # list all datasets in languageR

?iris           # documentation for a built-in dataset
nrow(iris); ncol(iris); names(iris)

You want to practice loading data without downloading any files. Which command correctly loads a built-in R dataset for immediate use?

  1. read.csv("iris") — reads the iris dataset from a CSV file in the working directory
  2. data("iris") — loads the iris dataset into the global environment from the datasets package
  3. load("iris.rda") — loads an RDA file called iris.rda from the working directory
  4. readRDS("iris") — loads an RDS object named “iris” from the working directory
Answer

b) data("iris")

The data() function loads built-in datasets from R packages into the current environment. No file path is needed. Options (a), (c), and (d) all assume the data exists as a file on disk, which it does not for built-in datasets.


Loading and Saving Unstructured Text Data

Section Overview

What you will learn: How to load single plain-text files into R as word vectors or line vectors; how to load an entire directory of text files into a named list; how to read content from Word (.docx) documents; and how to save text data back to disk

Single Text Files

Corpus linguists routinely work with raw text stored in plain-text (.txt) files. R provides three primary functions for reading these:

Functions for loading plain-text files
Function Returns Best for
scan(what = "char") Character vector of individual words Token-level analysis, word counts
readLines() Character vector of lines Sentence/line-level analysis, concordancing
readr::read_file() Single character string Full-text manipulation, regex over entire document
Code
# scan(): reads tokens (whitespace-separated), returns a character vector
testtxt_words <- scan(
  here::here("tutorials/load/data", "english.txt"),
  what  = "char",
  quiet = TRUE    # suppress "Read N items" message
)

cat("Total tokens:", length(testtxt_words), "\n")
Total tokens: 21 
Code
head(testtxt_words, 20)
 [1] "Linguistics" "is"          "the"         "scientific"  "study"      
 [6] "of"          "language"    "and"         "it"          "involves"   
[11] "the"         "analysis"    "of"          "language"    "form,"      
[16] "language"    "meaning,"    "and"         "language"    "in"         
Code
# readLines(): reads complete lines, returns a character vector
testtxt_lines <- readLines(
  con      = here::here("tutorials/load/data", "english.txt"),
  encoding = "UTF-8",
  warn     = FALSE
)

cat("Total lines:", length(testtxt_lines), "\n")
Total lines: 1 
Code
head(testtxt_lines, 5)
[1] "Linguistics is the scientific study of language and it involves the analysis of language form, language meaning, and language in context. "
Code
# readr::read_file(): loads the entire file as one string
testtxt_full <- readr::read_file(
  here::here("tutorials/load/data", "english.txt")
)

cat("Character count:", nchar(testtxt_full), "\n")

# Apply regex to the full text — e.g. extract sentences ending with ?
questions <- stringr::str_extract_all(testtxt_full, "[A-Z][^.!?]*\\?")[[1]]
Encoding and Non-ASCII Characters

Always specify encoding = "UTF-8" when reading files that may contain non-ASCII characters (accented letters, IPA symbols, non-Latin scripts). If readLines() throws a warning about invalid multibyte strings, the file encoding may be Latin-1 or Windows-1252.

raw_text <- readLines(f, encoding = "latin1")
utf_text  <- iconv(raw_text, from = "latin1", to = "UTF-8")

Saving Single Text Files

Code
# writeLines(): write a character vector (one element per line)
writeLines(
  text     = testtxt_lines,
  con      = here::here("tutorials/load/data", "english_out.txt"),
  useBytes = FALSE
)

# write_file(): write a single character string
readr::write_file(
  x    = testtxt_full,
  file = here::here("tutorials/load/data", "english_out2.txt")
)

Loading Multiple Text Files

When working with corpora, you will often need to load many text files at once and store them in a named list — one element per file. The recommended approach uses list.files() to discover files and purrr::map() to load them:

Code
# Step 1: get all file paths
fls <- list.files(
  path       = here::here("tutorials/load/data", "testcorpus"),
  pattern    = "\\.txt$",
  full.names = TRUE
)

cat("Files found:", length(fls), "\n")
Files found: 7 
Code
basename(fls)
[1] "linguistics01.txt" "linguistics02.txt" "linguistics03.txt"
[4] "linguistics04.txt" "linguistics05.txt" "linguistics06.txt"
[7] "linguistics07.txt"
Code
# Helper: read one file safely, with UTF-8 fallback
read_txt_safe <- function(f) {
  txt <- tryCatch(
    readLines(f, encoding = "UTF-8", warn = FALSE),
    error = function(e) readLines(f, encoding = "latin1", warn = FALSE)
  )
  txt <- iconv(txt, from = "", to = "UTF-8", sub = "byte")
  paste(txt, collapse = " ")
}

# purrr::map_chr: one string per file
txts_purrr <- purrr::map_chr(fls, read_txt_safe)
names(txts_purrr) <- tools::file_path_sans_ext(basename(fls))

cat("Texts loaded:", length(txts_purrr), "\n")
Texts loaded: 7 
Code
print(nchar(txts_purrr))
linguistics01 linguistics02 linguistics03 linguistics04 linguistics05 
          946           523           751           673           898 
linguistics06 linguistics07 
         1172           496 
Code
# Build a corpus data frame: one row per text
corpus_df <- data.frame(
  file     = tools::file_path_sans_ext(basename(fls)),
  text     = txts_purrr,
  n_tokens = sapply(strsplit(txts_purrr, "\\s+"), length),
  n_chars  = nchar(txts_purrr),
  stringsAsFactors = FALSE,
  row.names = NULL
)

corpus_df
           file
1 linguistics01
2 linguistics02
3 linguistics03
4 linguistics04
5 linguistics05
6 linguistics06
7 linguistics07
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  text
1                                                                                                                                                                                                                                   Linguistics is the scientific study of language. It involves analysing language form language meaning and language in context. The earliest activities in the documentation and description of language have been attributed to the th-century-BC Indian grammarian Pa?ini who wrote a formal description of the Sanskrit language in his A??adhyayi.  Linguists traditionally analyse human language by observing an interplay between sound and meaning. Phonetics is the study of speech and non-speech sounds and delves into their acoustic and articulatory properties. The study of language meaning on the other hand deals with how languages encode relations between entities properties and other aspects of the world to convey process and assign meaning as well as manage and resolve ambiguity. While the study of semantics typically concerns itself with truth conditions pragmatics deals with how situational context influences the production of meaning. 
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Grammar is a system of rules which governs the production and use of utterances in a given language. These rules apply to sound as well as meaning, and include componential subsets of rules, such as those pertaining to phonology (the organisation of phonetic sound systems), morphology (the formation and composition of words), and syntax (the formation and composition of phrases and sentences). Many modern theories that deal with the principles of grammar are based on Noam Chomsky's framework of generative linguistics.
3                                                                                                                                                                                                                                                                                                                                                                                                                                      In the early 20th century, Ferdinand de Saussure distinguished between the notions of langue and parole in his formulation of structural linguistics. According to him, parole is the specific utterance of speech, whereas langue refers to an abstract phenomenon that theoretically defines the principles and system of rules that govern a language. This distinction resembles the one made by Noam Chomsky between competence and performance in his theory of transformative or generative grammar. According to Chomsky, competence is an individual's innate capacity and potential for language (like in Saussure's langue), while performance is the specific way in which it is used by individuals, groups, and communities (i.e., parole, in Saussurean terms). 
4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The study of parole (which manifests through cultural discourses and dialects) is the domain of sociolinguistics, the sub-discipline that comprises the study of a complex system of linguistic facets within a certain speech community (governed by its own set of grammatical rules and laws). Discourse analysis further examines the structure of texts and conversations emerging out of a speech community's usage of language. This is done through the collection of linguistic data, or through the formal discipline of corpus linguistics, which takes naturally occurring texts and studies the variation of grammatical and other features based on such corpora (or corpus data). 
5                                                                                                                                                                                                                                                                                   Stylistics also involves the study of written, signed, or spoken discourse through varying speech communities, genres, and editorial or narrative formats in the mass media. In the 1960s, Jacques Derrida, for instance, further distinguished between speech and writing, by proposing that written language be studied as a linguistic medium of communication in itself. Palaeography is therefore the discipline that studies the evolution of written scripts (as signs and symbols) in language. The formal study of language also led to the growth of fields like psycholinguistics, which explores the representation and function of language in the mind; neurolinguistics, which studies language processing in the brain; biolinguistics, which studies the biology and evolution of language; and language acquisition, which investigates how children and adults acquire the knowledge of one or more languages. 
6 Linguistics also deals with the social, cultural, historical and political factors that influence language, through which linguistic and language-based context is often determined. Research on language through the sub-branches of historical and evolutionary linguistics also focus on how languages change and grow, particularly over an extended period of time.  Language documentation combines anthropological inquiry (into the history and culture of language) with linguistic inquiry, in order to describe languages and their grammars. Lexicography involves the documentation of words that form a vocabulary. Such a documentation of a linguistic vocabulary from a particular language is usually compiled in a dictionary. Computational linguistics is concerned with the statistical or rule-based modeling of natural language from a computational perspective. Specific knowledge of language is applied by speakers during the act of translation and interpretation, as well as in language education <96> the teaching of a second or foreign language. Policy makers work with governments to implement new plans in education and teaching which are based on linguistic research. 
7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Related areas of study also includes the disciplines of semiotics (the study of direct and indirect language through signs and symbols), literary criticism (the historical and ideological analysis of literature, cinema, art, or published material), translation (the conversion and documentation of meaning in written/spoken text from one language or dialect onto another), and speech-language pathology (a corrective method to cure phonetic disabilities and dis-functions at the cognitive level).
  n_tokens n_chars
1      138     946
2       81     523
3      111     751
4      101     673
5      130     898
6      165    1172
7       68     496

Saving Multiple Text Files

Code
out_dir <- here::here("tutorials/load/data", "testcorpus_out")
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)

out_paths <- file.path(out_dir, paste0(names(txts_purrr), ".txt"))

purrr::walk2(txts_purrr, out_paths, ~ writeLines(.x, con = .y))
cat("Saved", length(out_paths), "files.\n")

Loading Word Documents

Interview transcripts, annotated texts, and survey instruments are often stored as Microsoft Word .docx files. The officer package reads .docx files and returns a structured data frame where each paragraph, heading, and table cell is a separate row.

Code
doc_object <- officer::read_docx(here::here("tutorials/load/data", "mydoc.docx"))
content    <- officer::docx_summary(doc_object)

str(content)
'data.frame':   38 obs. of  11 variables:
 $ doc_index      : int  1 2 3 4 5 6 8 9 10 11 ...
 $ content_type   : chr  "paragraph" "paragraph" "paragraph" "paragraph" ...
 $ style_name     : chr  NA NA NA NA ...
 $ text           : chr  "HYPERLINK \"https://en.wikipedia.org/wiki/Main_Page\"" "Language technology" "From Wikipedia, the free encyclopedia" "Language technology, often called human language technology (HLT), studies methods of how computer programs or "| __truncated__ ...
 $ table_index    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ row_id         : int  NA NA NA NA NA NA NA NA NA NA ...
 $ cell_id        : int  NA NA NA NA NA NA NA NA NA NA ...
 $ is_header      : logi  NA NA NA NA NA NA ...
 $ row_span       : int  NA NA NA NA NA NA NA NA NA NA ...
 $ col_span       : chr  NA NA NA NA ...
 $ table_stylename: chr  NA NA NA NA ...
Code
head(content, 15)
   doc_index content_type style_name
1          1    paragraph       <NA>
2          2    paragraph       <NA>
3          3    paragraph       <NA>
4          4    paragraph       <NA>
5          5    paragraph       <NA>
6          6    paragraph       <NA>
7          8    paragraph       <NA>
8          9    paragraph       <NA>
9         10    paragraph       <NA>
10        11    paragraph       <NA>
11        12    paragraph       <NA>
12        13    paragraph       <NA>
13        14    paragraph       <NA>
14        15    paragraph       <NA>
15        16    paragraph       <NA>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         HYPERLINK "https://en.wikipedia.org/wiki/Main_Page"
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Language technology
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       From Wikipedia, the free encyclopedia
4  Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech.[1] Working with language technology often requires broad knowledge not only about linguistics but also about computer science. It consists of natural language processing (NLP) and computational linguistics (CL) on the one hand, many application oriented aspects of these, and more low-level aspects such as encoding and speech technology on the other hand. 
5                                                                                                                   Note that these elementary aspects are normally not considered to be within the scope of related terms such as natural language processing and (applied) computational linguistics, which are otherwise near-synonyms. As an example, for many of the world's lesser known languages, the foundation of language technology is providing communities with fonts and keyboard setups so their languages can be written on computers or mobile devices.[2] 
6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  References
7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Uszkoreit, Hans. "DFKI-LT - What is Language Technology". Retrieved 16 November 2018. 
8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "SIL Writing Systems Technology". sil.org. 11 December 2018. Retrieved 9 December 2019. 
9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              External links
10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Johns Hopkins University Human Language Technology Center of Excellence
11                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Carnegie Mellon University Language Technologies Institute
12                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Institute for Applied Linguistics (IULA) at Universitat Pompeu Fabra. Barcelona, Spain
13                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          German Research Centre for Artificial Intelligence (DFKI) Language Technology Lab
14                                                                                                                                                                                                                                                                                                                                                                                                                                                                       CLT: Centre for Language Technology in Gothenburg, Sweden Archived 2017-04-10 at the Wayback Machine
15                                                                                                                                                                                                                                                                                                                                                                                                                                                       The Center for Speech and Language Technologies (CSaLT) at the Lahore University [sic] of Management Sciences (LUMS)
   table_index row_id cell_id is_header row_span col_span table_stylename
1           NA     NA      NA        NA       NA     <NA>            <NA>
2           NA     NA      NA        NA       NA     <NA>            <NA>
3           NA     NA      NA        NA       NA     <NA>            <NA>
4           NA     NA      NA        NA       NA     <NA>            <NA>
5           NA     NA      NA        NA       NA     <NA>            <NA>
6           NA     NA      NA        NA       NA     <NA>            <NA>
7           NA     NA      NA        NA       NA     <NA>            <NA>
8           NA     NA      NA        NA       NA     <NA>            <NA>
9           NA     NA      NA        NA       NA     <NA>            <NA>
10          NA     NA      NA        NA       NA     <NA>            <NA>
11          NA     NA      NA        NA       NA     <NA>            <NA>
12          NA     NA      NA        NA       NA     <NA>            <NA>
13          NA     NA      NA        NA       NA     <NA>            <NA>
14          NA     NA      NA        NA       NA     <NA>            <NA>
15          NA     NA      NA        NA       NA     <NA>            <NA>
Code
# Filter to non-empty paragraphs
paragraphs <- content |>
  dplyr::filter(content_type == "paragraph",
                !is.na(text),
                nchar(trimws(text)) > 0) |>
  dplyr::select(style_name, text)

head(paragraphs, 10)
   style_name
1        <NA>
2        <NA>
3        <NA>
4        <NA>
5        <NA>
6        <NA>
7        <NA>
8        <NA>
9        <NA>
10       <NA>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         text
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         HYPERLINK "https://en.wikipedia.org/wiki/Main_Page"
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Language technology
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       From Wikipedia, the free encyclopedia
4  Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech.[1] Working with language technology often requires broad knowledge not only about linguistics but also about computer science. It consists of natural language processing (NLP) and computational linguistics (CL) on the one hand, many application oriented aspects of these, and more low-level aspects such as encoding and speech technology on the other hand. 
5                                                                                                                   Note that these elementary aspects are normally not considered to be within the scope of related terms such as natural language processing and (applied) computational linguistics, which are otherwise near-synonyms. As an example, for many of the world's lesser known languages, the foundation of language technology is providing communities with fonts and keyboard setups so their languages can be written on computers or mobile devices.[2] 
6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  References
7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Uszkoreit, Hans. "DFKI-LT - What is Language Technology". Retrieved 16 November 2018. 
8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "SIL Writing Systems Technology". sil.org. 11 December 2018. Retrieved 9 December 2019. 
9                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              External links
10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Johns Hopkins University Human Language Technology Center of Excellence
Code
# Extract only body text (style "Normal" in most templates)
body_text <- paragraphs |>
  dplyr::filter(style_name == "Normal") |>
  dplyr::pull(text) |>
  paste(collapse = " ")

cat("Body text (first 200 chars):\n", substr(body_text, 1, 200), "\n")
Body text (first 200 chars):
  
Extracting Headings from Word Documents

Headings are stored with style names like "heading 1", "heading 2", etc.:

headings <- content |>
  dplyr::filter(grepl("^heading", style_name, ignore.case = TRUE)) |>
  dplyr::select(style_name, text)

Useful for segmenting interview transcripts by topic or speaker turn.

You want to load 50 interview transcripts stored as .txt files in a folder called transcripts/. You need a named character vector — one element per interview, named by file stem. Which code achieves this?

  1. txts <- readLines(here::here("transcripts"))
fls  <- list.files(here::here("transcripts"), pattern="\\.txt$", full.names=TRUE)
txts <- purrr::map_chr(fls, readr::read_file)
names(txts) <- tools::file_path_sans_ext(basename(fls))
  1. txts <- read.csv(here::here("transcripts"), header = FALSE)
  2. txts <- scan(here::here("transcripts"), what = "char", quiet = TRUE)
Answer

b) list.files() with full.names = TRUE returns complete paths. purrr::map_chr() applies readr::read_file() to each, returning a character vector of full texts. tools::file_path_sans_ext(basename(fls)) strips the path and .txt extension to produce clean names. Options (a), (c), and (d) are all incorrect: readLines() and scan() take a single file path, not a directory; read.csv() expects tabular data.


Citation and Session Info

Schweinberger, Martin. 2026. Loading and Saving Data in R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/load/load.html (Version 2026.05.01).

@manual{schweinberger2026loadr,
  author       = {Schweinberger, Martin},
  title        = {Loading and Saving Data in R},
  note         = {https://ladal.edu.au/tutorials/load/load.html},
  year         = {2026},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL)},
  address      = {Brisbane},
  edition      = {2026.05.01}
}
Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggplot2_4.0.2    purrr_1.0.4      xml2_1.3.6       jsonlite_1.9.0  
 [5] writexl_1.5.1    readxl_1.4.3     readr_2.1.5      officer_0.7.3   
 [9] data.tree_1.1.0  here_1.0.2       openxlsx_4.2.8   flextable_0.9.11
[13] tidyr_1.3.2      stringr_1.5.1    dplyr_1.2.0     

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.56               htmlwidgets_1.6.4      
 [4] lattice_0.22-6          tzdb_0.4.0              vctrs_0.7.1            
 [7] tools_4.4.2             generics_0.1.3          parallel_4.4.2         
[10] tibble_3.2.1            pkgconfig_2.0.3         Matrix_1.7-2           
[13] data.table_1.17.0       RColorBrewer_1.1-3      S7_0.2.1               
[16] uuid_1.2-1              lifecycle_1.0.5         compiler_4.4.2         
[19] farver_2.1.2            textshaping_1.0.0       codetools_0.2-20       
[22] fontquiver_0.2.1        fontLiberation_0.1.0    htmltools_0.5.9        
[25] yaml_2.3.10             crayon_1.5.3            pillar_1.10.1          
[28] openssl_2.3.2           nlme_3.1-166            fontBitstreamVera_0.1.1
[31] tidyselect_1.2.1        zip_2.3.2               digest_0.6.39          
[34] stringi_1.8.4           splines_4.4.2           labeling_0.4.3         
[37] rprojroot_2.1.1         fastmap_1.2.0           grid_4.4.2             
[40] cli_3.6.4               magrittr_2.0.3          patchwork_1.3.0        
[43] utf8_1.2.4              withr_3.0.2             gdtools_0.5.0          
[46] scales_1.4.0            bit64_4.6.0-1           rmarkdown_2.30         
[49] bit_4.5.0.1             cellranger_1.1.0        askpass_1.2.1          
[52] ragg_1.3.3              hms_1.1.3               evaluate_1.0.3         
[55] knitr_1.51              mgcv_1.9-1              rlang_1.1.7            
[58] Rcpp_1.1.1              glue_1.8.0              renv_1.1.7             
[61] rstudioapi_0.17.1       vroom_1.6.5             R6_2.6.1               
[64] systemfonts_1.3.1      
AI Transparency Statement

This tutorial was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to substantially expand and restructure a shorter existing LADAL tutorial on loading and saving data. All content was reviewed and approved by the named author (Martin Schweinberger), who takes full responsibility for its accuracy.


Back to top

Back to LADAL home


References